System and method for automatic segmentation of ASR transcripts

ABSTRACT

Text segmentation based on topic boundary detection has been an industry problem in automating information dissemination to targeted users. A system for automatic segmentation of ASR output text involves boundary identification based on “topic” changes. The proposed approach is based on building a weighted graph to determine dependency in input sentences based on bi-directional analysis of the input sentences. Furthermore, the input sentences are segmented based on the notion of segment cohesiveness and the segmented sentences are merged based on preamble and postamble analyses.

FIELD OF THE INVENTION

The present invention relates to text analysis in general, and moreparticularly, text analysis of automated speech recognizer outputs.Still more particularly, the present invention related to a system andmethod for analyzing input text to determine a dependency graph based onbidirectional analysis and to segment and merge input text based on thedetermined dependency graph.

BACKGROUND OF THE INVENTION

Identification of coherent sections of sentences is a form of textsegmentation Processing of output from an automatic speech recognition(ASR) system is a widely applicable scenario for such text segmentation.In a variety of applicable scenarios, the plain text does not containany title or annotation to hint about the subtopics discussed. Further,there is a need to segment ASR transcripts to determine a group ofsentences wherein such a group need not have to have temporalcohesiveness. Text segmentation has been widely applied in topicidentification, text summarization, categorization, informationretrieval and dissemination.

Consider a scenario of broadcast news packaging for registered users.The users' profile provides information about the kind of news packagesthat need to be delivered to the various users. As news is broadcast, itis required to analyze the generated ASR transcripts, identify newssegments, and combine multiple segments as a package of audio and videofor delivery. Another scenario of interest is scene based segmentationof a video. While it is interesting to determine scenes based on videoanalysis, it is not completely error-free. In order to complement suchan approach, it is useful to analyze the associated audio and convertthe same to text form using an ASR system, and the segmentation of thegenerated text could assist in scene segmentation.

DESCRIPTION OF RELATED ART

U.S. Pat. No. 6,928,407 to Ponceleon; Dulce Beatriz (Palo Alto, Calif.),Srinivasan; Savitha (San Jose, Calif.) for “System and method for theautomatic discovery of salient segments in speech transcripts” (issuedon Aug. 9, 2005 and assigned to International Business MachinesCorporation (Armonk, N.Y.)) describes a system and associated method toautomatically discover salient segments in a speech transcript and focuson the segmentation of an audio/video source into topically cohesivesegments based on Automatic Speech Recognition (ASR) transcriptionsusing the word n-grams extracted from the speech transcript.

U.S. Pat. No. 6,772,120 to Moreno; Pedro J. (Cambridge, Mass.), Blei;David M. (Oakland, Calif.) for “Computer method and apparatus forsegmenting text streams” (issued on Aug. 3, 2004 and assigned toHewlett-Packard Development Company, L.P. (Houston, Tex.)) describes acomputer method and apparatus for segmenting text streams based oncomputed probabilities associated with a group of words with respect toa topic selected from a set of predetermined topics.

U.S. Pat. No. 6,529,902 to Kanevsky; Dimitri (Ossining, N.Y.), Yashchm;Emmanuel (Yorktown Heights, N.Y.) for “Method and system for off-linedetection of textual topical changes and topic identification vialikelihood based methods for improved language modeling” (issued on Mar.4, 2003 and assigned to International Business Machines Corporation(Armonk, N.Y.)) describes a system (and method) for off-line detectionof textual topical changes that includes at least one central processingunit (CPU), at least one memory coupled to the at least one CPU, anetwork connectable to the at least one CPU, and a database, stored onthe at least one memory, containing a plurality of textual data set oftopics. The CPU executes first and second processes in first and seconddirections, respectively, for extracting a segment having apredetermined size from a text, computing likelihood scores of a text inthe segment for each topic, computing likelihood ratios, comparing themto a threshold, and defining whether there is a change point at thecurrent last word in a window.

U.S. Pat. No. 6,104,989 to Kanevsky; Dimitri (Ossining, N.Y.), Yashchm;Emmanuel (Yorktown Heights, N.Y.) for “Real time detection of topicalchanges and topic identification via likelihood based methods” (issuedon Aug. 15, 2000 and assigned to International Business MachinesCorporation (Armonk, N.Y.)) describes a method for detecting topicalchanges and topic identification in texts in real time using likelihoodratio based methods.

“LEXTER, a Natural Language Tool for Terminology Extraction” byBourigault D., Gonzalez I., and Gros C. (appeared in Proceedings of theseventh EURALEX International Congress, Goteborg, Sweden, 1996),describes the use of natural language processing to extract phrases bymeans of syntactical structures.

“Word Association Norms, Mutual Information and Lexicography” by Church,K. and Hanks, P. (appeared in Computational Linguistics, Volume 16,Number 1, 1991), describes the use of statistical occurrence measuresfor the purposes of phrase extraction.

“TextTiling: Segmenting Text into Multi-Paragraph Subtopic Passages” byHearst, M. (appeared in Computational Linguistics, Volume 23, Number 1,1997), “Advances in Domain Independent Linear Text Segmentation” by ChoiF. (appeared in Proceedings of the North American Chapter of ACL, 2000),“SeLeCT: A Lexical Cohesion Based News Story Segmentation System” byStokes N., Carthy J., and Smeaton A. F. (appeared in Journal of AICommunications, Volume 17, Number 1, 2004) describe the methodologiesbased on linguistic techniques such as lexical cohesion for textsegmentation.

“Query Expansion Using Local and Global Document Analysis” by Xu J. andCroft W. B. (appeared in Proceedings of the Nineteenth AnnualInternational ACM SIGIR Conference on Research and Development inInformation Retrieval, 1996), and “Text Segmentation by Topic” by PonteJ. M. and Croft W. B. (appeared in Proceedings of the First EuropeanConference on Research and Advanced Technology for Digital Libraries,1997) describe the approaches based on local context analysis.

“Segmenting Conversations by Topic, Initiative, and Style” by Ries K.(appeared in Proceedings of ACM SIGIR'01 Workshop on InformationRetrieval Techniques for Speech Applications, Louisiana, 2001) describesthe segmentation of speech recognizer transcripts based on speakerinitiative and style to achieve topical segmentation.

“Automatic extraction of key sentences from oral presentations usingstatistical measure based on discourse markers” by Kitade T., Nanjo H.,and Kawahara T. (appeared in Proceedings of International Conference onSpoken Language Processing (ICSLP), 2004) describes the use of discoursemarkers at the beginning of sections in presentations for detectingsection boundaries.

“Domain-independent Text Segmentation Using Anisotropic Diffusion andDynamic Programming” by Ji X. and Zha H. (appeared in Proceedings of the26th Annual International ACM SIGIR Conference on Research andDevelopment in Information Retrieval, 2003) describes adomain-independent text segmentation method that identifies theboundaries of topic changes in long text documents and/or text streamsbased on anisotropic diffusion technique applied to an imagerepresentation of sentence-distance matrix.

“Minimum Cut Model for Spoken Lecture Segmentation” by Malioutov I. andBarzilay R. (appeared in Proceedings of the 21 st InternationalConference on Computational Linguistics of the Association forComputational Linguistics, 2006) describes the task of unsupervisedlecture segmentation and applies graph partitioning to identify topicsentences. The similarity computation presented is based on exponentialcosine similarity.

The known systems do not address the various issues related to textsegmentation including the dependence on lexicon for enforcing syntacticand semantic structures, determining of inter-sentence relationshipbased on bi-directional (forward and reverse) analysis, assessing ofsegment cohesiveness, and merging of related segments. The presentinvention provides a system for addressing these issues in order toachieve more effective text segmentation.

SUMMARY OF THE INVENTION

The primary objective of the invention is to determine a plurality ofsentence segments given an input text of plurality of sentences.

One aspect of the present invention is to determine a weighted graphbased on bi-directional analysis of a plurality of sentences.

Another aspect of the present invention is to determine cohesiveness ofa sentence segment.

Yet another aspect of present invention is to determine a plurality ofsentence segments based on segment cohesiveness.

Another aspect of the present invention is to determine a plurality ofpreamble segments given a plurality of sentence segments.

Yet another aspect of the present is to determine a plurality ofpostamble segments given a plurality of sentence segments.

Another aspect of the invention is to merge a preamble segment with asentence segment of a plurality of sentence segments.

Yet another aspect of the invention is to merge a postamble segment witha sentence segment of a plurality of sentence segments.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 depicts an overview of text segmentation system.

FIG. 2 depicts an illustrative input text.

FIG. 3 depicts an illustrative frequency matrix.

FIG. 4 provides an algorithm for dependency graph generation.

FIG. 5 depicts an illustrative dependency graph

FIG. 6 provides an algorithm for cohesiveness based graph segmentation.

FIG. 6 a provides an algorithm for segment cohesiveness analysis.

FIG. 6 b provides an algorithm for segment grouping.

FIG. 7 depicts an illustrative graph segments.

FIG. 8 depicts an illustrative text segments.

FIG. 9 depicts an illustrative merged text segments.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

Text segmentation has been widely applied in topic identification, textsummarization, categorization, information retrieval and dissemination.The plain text under consideration does not contain any title orannotation to hint about the subtopics discussed. It is assumed thatsentences of the plain text are separated by periods; however there areno paragraph demarcations. Each sentence needs to be parsed, in aniterative manner, to check if some incoherence exists between sentences.Continuity of a topic, discussed in consecutive sentences, can beidentified by means of certain frequency measures of the constituentwords across the sentences. The task is analogous to shot detection invideo.

FIG. 1 depicts an overview of text segmentation system. The mainobjective of the present invention is to analyze an input text to dividethe same into a cohesive text segments. It further envisaged to achievethis objective without a lexicon providing information about syntacticand semantic substructures. The text under consideration is a set ofsentences. An important first step is tokenization (100). In thetokenization phase, the input sentences are decomposed into tokens thatare either words or atomic terms. The noise words are filtered out basedon a list of stop-words. This list is customized to exclude pronounrelated words. The second step is related to stemming (102). In order toeliminate duplicate words, stemming is performed resulting in the rootwords. Gramming is done to correct spelling mistakes or errors due tospeech-to-text conversion. This is performed by evaluating trigrams orsets of three consecutive characters. The third step is to buildFrequency Matrix (104). The Frequency Matrix consists of the frequenciesof tokens in different sentences. The sequence of sentences ismaintained in the same order as they occur in the given text. The tokensare clubbed together based on the (syntactic) relationship of words.

Boundary detection involves in constructing graph representation (106)of a given set of sentences with edge weights that depict syntactic andsemantic relationship among sentences. And the dependency graph isanalyzed and segmented (108) based on temporal characteristics and whereappropriate, the identified segments are grouped (110) based on spatialcharacteristics.

FIG. 2 depicts an illustrative input text. Note that input text is ageneral news related text and is based on the text available at thesite:

http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.htmlThe sentences in the input text are demarcated by a period.

FIG. 3 depicts an illustrative frequency matrix. The depicted FrequencyMatrix is related to the input text depicted in FIG. 2. Please note thatonly a subset of tokens is provided for illustrative purposes.

FIG. 4 provides an algorithm for dependency graph generation.Graphically, each sentence is represented by a node. The weights ondirected edges between the nodes indicate the degree of coherencebetween the corresponding sentences. These values are derived from thefrequency matrix m X n, where m is the number of sentences and n is thenumber of tokens or filtered words in the given text. Given m sentences,there could be at most (m−1) boundaries. These are initially referred toas candidate boundaries. We assess the strengths of these boundaries B1,B2, . . . Bm−1 by means of linking succeeding sentences with similartokens. This is based on neighborhood effect and is defined based onFM(k)*log(M/Dj))/Ti wherein FM(k) is the frequency of token k in asentence, M is the total number of sentences, Dj is the distance tosimilar token in a subsequent sentence, and Ti is the number of tokensin the sentence. The intuition behind the above formulation is thatsimilar tokens present in neighboring sentences should be given higherweightage than those in sentences that are far apart. This is achievedby incorporating the distance based on the sequence of a sentence in thetext. In the above approach, the distance measured by linking forwardneeds to be unidirectional; otherwise the effect would be reduced.Reverse linking is considered separately, by replacing Dj with thedistance to a preceding sentence with the token. Typically, sentencescontaining anaphoric references are assigned higher weighted links topreceding sentences with the entities, while those with cataphoricreferences are assigned higher weighted links to the succeedingsentences. Hence, directions are important in distinguishing thesegments. The final weight of an edge is computed based on these twoforward and reverse linkings.

FIG. 5 depicts an illustrative dependency graph for the input textdepicted in FIG. 2. Observe that the edge weights are normalized andalmost the set of sentences of the input text appear as a singleconnected graph.

FIG. 6 provides an algorithm for cohesiveness based graph segmentation.In order to identify segment boundaries, it is required to cut thesingle connected graph so that multiple segments present in the inputtext can be determined. The graph segmentation leads directly to inputtext segmentation as each node in the graph represents an inputsentence. While edge weights of a graph play a role in the segmentationprocess, it is required to assess the cohesiveness of the sentencesrepresented by the graph in order to take a decision whether the graphneeds to be further segmented or not. Successive segmentation of thegraph leads to smaller and smaller subgraphs, and finally, the sentencesrepresented by each subgraph that remains forms a cohesive text segment.

FIG. 6 a provides an algorithm for segment cohesiveness analysis. Theassessment of cohesiveness of a graph is based on the notion of theextent of support each sentence of represented by a node of the graphprovides for the rest of the sentences represented by the graph. Thissupport is computed based on the shortest path between two nodes in thegraph and the edge weights of this shortest path. And, the cohesivenessis computed based on the normalized pair-wise overall weight of theshortest weighted path across all of the nodes of the graph.

FIG. 6 b provides an algorithm for segment grouping. The need forsegment grouping is on account of the observation that there areintersegment relationships that are not based on segmental neighborhoodproperties. A distinct kind of segment grouping that has practicalapplications is based on identifying the three portions in the textinput: Preamble Text Segments (also called as header segments), BodyText Segments (also called as main segments), and Postamble TextSegments (also called as footer segments): associate a preamble segmentwith one or more body segments; and similarly, associate a postamblesegment with one or more body segments. With reference to FIG. 6 b,preamble identification is based on the observation that the successivesegments in the preamble segment are of similar size (that is, number ofsentences in a segment) and differ drastically with respect to thesegments in the body. A similar distinction remains between postamblesegments and body segments as described in FIG. 6 b. FIG. 6 b alsodescribes spatial merging in which a preamble or postamble segment ismerged with one or more body segments based on the term co-occurrencebetween two segments under consideration. Note that as this spatialmerging is a special case of general segment merges, an underlyingassumption is that a preamble or postamble segment gets merged with atleast one body segment.

FIG. 7 depicts an illustrative graph segments. Observe how the header,main, and footer segments stand out after the process of segmentation.One of the header segments involves sentences 1 and 2. And, sentences 8through 13 form a main segment.

FIG. 8 depicts an illustrative text segments. In the illustration,header, main, and footer segments are also indicated.

FIG. 9 depicts an illustrative merged text segments. Note that a headersegment, Segment H1, is merged with a main segment, Segment A.Similarly, Segment H2 is merged with Segment B, Segment F1 is mergedwith Segment A, and Segment F2 is merged with Segment C.

Thus, a system and method for text based analysis of automated speechrecognizer transcripts is disclosed. Although the present invention hasbeen described particularly with reference to the figures, it will beapparent to one of the ordinary skill in the art that the presentinvention may appear in any number of systems that perform textualanalysis for segmentation and segment merging. It is furthercontemplated that many changes and modifications may be made by one ofordinary skill in the art without departing from the spirit and scope ofthe present invention.

1. A text segmentation system, TSS, for segmenting a plurality ofsentences, said system comprising: (a) Dependency Graph ConstructionElement for determining a weighted-graph based on bi-directionalanalysis of said plurality of sentences; (b) Graph Segmentation Elementfor determining a plurality of sentence segments of said plurality ofsentences based on segment cohesiveness; and (c) Graph Merging Elementfor grouping said plurality of sentence segments.
 2. The system of claim1, wherein said Dependency Graph Construction Element comprises aprocedure to compute proportional edge weight between two nodes (firstnode and second node) of a graph with respect to a token K, wherein saidfirst node is associated with sentence I of said plurality of sentencesand said second node is associated with sentence J of said plurality ofsentences, said computing comprises: computing FMIK as the number ofoccurrences of said token K in said sentence I, computing FMJK as thenumber of occurrences of said token K in said sentence J, computing M asthe number of sentences in said plurality of sentences, computing TI asthe number of tokens in said sentence I, and computing proportional edgeweight PEIJ as FMIK*LOG(M*|I−J|)/TI if FMJK is >0.
 3. The system ofclaim 2, wherein said Dependency Graph Construction Element furthercomprises a procedure to compute proportional forward edge weight withrespect to a node I of a graph, a token K, and a plurality of nodes ofsaid graph, said computing comprises: computing sum of proportional edgeweight with respect to said node I and each of node J of said pluralityof nodes of said graph, wherein J is >I.
 4. The system of claim 2,wherein said Dependency Graph Construction Element further comprises aprocedure to compute proportional reverse edge weight with respect to anode I of a graph, a token K, and a plurality of nodes of said graph,said computing comprises: computing sum of proportional edge weight withrespect to said node I and each of node J of said plurality of nodes ofsaid graph, wherein J is <I.
 5. The system of claim 2, wherein saidDependency Graph Construction Element further comprises a procedure tocompute forward edge weight with respect to a node I of a graph, saidcomputing comprises: computing sum of proportional forward edge weightwith respect to each of token K of sentence I associated with said nodeI, and a plurality of nodes of said graph.
 6. The system of claim 2,wherein said Dependency Graph Construction Element further comprises aprocedure to compute reverse edge weight with respect to a node I of agraph, said computing comprises: computing sum of proportional reverseedge weight with respect to each of token K of sentence I associatedwith said node I, and a plurality of nodes of said graph.
 7. The systemof claim 2, wherein said Dependency Graph Construction Element furthercomprises a procedure to compute edge weight between a node I and a nodeJ of a graph, said computing comprises: computing FIJ as forward edgeweight between said node I and said node J, computing RIJ as reverseedge weight between said node I and said node J, computing M as thenumber of sentences in said plurality of sentences, computing MX as themaximum number of tokens in any sentence of said plurality of sentences,computing MN as the minimum number of tokens in any sentence of saidplurality of sentences, computing MIN as LOG(M/(M−1))/MX, computing MAXas LOG(M)/MN, and computing edge weight between said node I and saidnode J as (FIJ+RIJ)/(2*(MAX−MIN)).
 8. The system of claim 1, whereinsaid Graph Segmentation Element comprises a procedure to computecohesiveness of a sentence segment of said plurality of sentencesegments, wherein said computing comprises: determining G as a graphassociated said sentence segment, computing M as the number of sentencesin said plurality of sentences, computing TW as the sum of edge weightsof said G, computing W as the sum of edge weights of the weightedshortest path between a vertex of said G and another vertex of said G,computing RS as the sum of W between vertex I of said G and each ofvertex J (>I) of said G, computing TS as sum of RS associated with eachvertex V of said G divided by TW, and computing cohesiveness of said Gas TS divided by M.
 9. The system of claim 8, wherein said GraphSegmentation Element further comprises a procedure to segment a sentencesegment of said plurality of sentence segments, wherein said segmentingcomprises: computing the cohesiveness of said sentence segment,determining a plurality of edges of a graph associated with saidsentence segment with least edge weights, and partitioning said graphbased on said plurality of edges if the cohesiveness of said graph is<apredefined threshold.
 10. The system of claim 1, wherein said GraphMerging Element comprises a procedure to determine a plurality ofpreamble segments of said plurality of sentence segments, wherein saiddetermining comprises: determining the first segment of said pluralityof segments and making the first segment part of said plurality ofpreamble segments if the size of the first segment is less than apredefined threshold, determining the last segment PS1 of said pluralityof preamble segments, determining SIZE1 as the number of sentences insaid last segment PS1, determining a segment PS2 of said plurality ofsegments that is successor to said segment PS1, determining SIZE2 as thenumber of sentences in said segment PS2, and making said segment PS2part of said plurality of preamble segments if |SIZE2−SIZE1|<apredefined threshold.
 11. The system of claim 10, wherein said GraphMerging Element further comprises a procedure to determine a pluralityof postamble segments of said plurality of sentence segments, whereinsaid determining comprises: determining the last segment of saidplurality of segments and making the last segment part of said pluralityof postamble segments if the size of said last segment is less than apredefined threshold, determining the first segment PS1 of saidplurality of postamble segments, determining SIZE1 as the number ofsentences in said segment PS1, determining a segment PS2 of saidplurality of segments that is predecessor to said segment PS1,determining SIZE2 as the number of sentences in said segment PS2, andmaking said segment PS2 part of said plurality of postamble segments if|SIZE2−SIZE1|<a predefined threshold.
 12. The system of claim 10,wherein said Graph Merging Element further comprises a procedure tomerge a preamble segment with a sentence segment of said plurality ofsentence segments, wherein said merging comprises: computingco-occurrence frequency count of said preamble segment with saidsentence segment of said plurality of sentence segments, and mergingsaid preamble segment with said sentence segment of said plurality ofsentence segments if the normalized said frequency count exceeds apredefined threshold.
 13. The system of claim 10, wherein said GraphMerging Element comprises a procedure to merge a postamble segment witha sentence segment of said plurality of sentence segments, wherein saidmerging comprises: computing co-occurrence frequency count of saidpostamble segment with said sentence segment of said plurality ofsentence segments, and merging said postamble segment with said sentencesegment of said plurality of sentence segments if the normalized saidfrequency count exceeds a predefined threshold.